HSS8005
  • Module plan
  • Resources
  • Materials
  • Data
  • Assessment
  • Canvas

Week 2 Computer Lab Worksheet

  • Weekly materials

  • Week 1
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 2
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 3
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 4
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 5
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 6
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 7
    • Notes
    • Presentation
    • Worksheet
    • Handout
  • Week 8
    • Notes
    • Presentation
    • Worksheet
    • Handout

On this page

  • Aims
  • Setup
  • Importing data
  • Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:
  • Exercise 2: What factors affect trust?
  • (Advanced) Exercise 3: Apply the model to a new dataset
  • SOLUTIONS
    • Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:
    • Exercise 2: What factors affect trust?
    • (Advanced) Exercise 3: Apply the model to a new dataset

Week 2 Computer Lab Worksheet

General linear models

Author

Chris Moreh

Aims

This session introduces simple and multiple linear regression models. You will be working with data from Österman (2021) to replicate parts their analysis. We will be covering only basic regression methods in this session, so the article will serve mainly as a broad background to the data here. We will be returning to this article in future weeks too, expanding our modelling strategy as we discover new methods. We will also practice some data management tasks and the basics of data visualisation using principles from ‘the grammar of graphics’ as implemented in the {ggplot2} package (see Kieran Healy’s Data Visualization: A practical introduction for an introduction with many practical examples).

By the end of the session, you will:

  • learn how to import data from foreign formats (e.g. SPSS, Stata, CSV)
  • know how to perform basic descriptive statistics on a dataset
  • understand the basics of data visualisation
  • know how to fit linear regression models in R using different functions
  • learn a few options for presenting findings from regression models

Setup

In Week 1 you set up R and RStudio, and an RProject folder (we called it “HSS8005_labs”) with an .R script and a .qmd or .Rmd document in it (we called these “Lab_1”). Ideally, you saved this on a cloud drive so you can access it from any computer (e.g. OneDrive). You will be working in this folder. If it’s missing, complete Task 2 from the Week 1 Lab.

You can create a new .R script and .qmd/.Rmd for this week’s work (e.g. “Lab_2”). Start working in the .R script initially, then switch to .qmd/.Rmd later in the session to report on your final analysis.

Importing data

As we have seen in Week 1, small datasets that are included in R packages (including base R) for demonstration purposes can be used by simply invoking the name of the dataset. For example, the command head(mtcars) would print out the first 5 rows (cases) in the “mtcars” dataset included in base R (more specifically, in its “datasets” package):

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
Tip

The data() function lists all the built-in packages. You can further specify package = to list datasets included in a given package; e.g.:

data(package = "dplyr")

# Note that {dplyr} is one of the data management packages included in the core {tidyverse}. Make sure the {tidyverse} is installed.

We can also import built-in datasets to our Environment in order to inspect them manually:

mtcars_data <- mtcars

If we want to access a dataset from a non-base-R package, we need to ensure that the package is installed on our system and that we specify the name of the package; e.g.:

head(starwars) # gives an Error
# A tibble: 6 × 14
  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
# … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#   ³​eye_color, ⁴​birth_year, ⁵​homeworld
head(dplyr::starwars)
# A tibble: 6 × 14
  name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
  <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
# … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#   ³​eye_color, ⁴​birth_year, ⁵​homeworld
  # works, as long as {dplyr} or the whole {tidyverse} are installed

Real-life datasets, however, need to be imported into R. Datasets come in various formats. R’s native data format has the extension .rds and can be imported with the readRDS() function. The counterpart function saveRDS() exports a dataset to the .rds format. The core-tidyverse {readr} package has similar functions (read_rds() / write_rds()).

The .rds format is useful because it can be compressed to various sizes to take up less space, but can only be directly opened in R. It is much more common to encounter data saved in a “delimited” text format, which can be easily opened in a spreadsheet viewer/editor such as Excel. This makes it very interchangeable and therefore very common. The most common is probably the “comma separated values” (.csv) format, which can be imported with the base-R function read.csv() or the tidyverse readr::read_csv() equivalent. Read Chapter 11 in R4DS for more on these functions.

Very often, you will need to import data saved in the format of another proprietary statistical analysis package such as SPSS or Stata. Large survey projects usually make data available in these formats. The great advantage of these formats is that they can incorporate additional information about variables and the levels of categorical variables (e.g. value labels, specific values for different types of missing values). These additional information can be extremely valuable, but they are not handled straight-forwardly in text-based format, spreadsheets and R’s native data format. To make them operational in R, we need a few specially designed functions.

The {haven} package — part of the extended {tidyverse}, meaning that it is installed on your system as part of {tidyverse}, but the library("tidyverse") command does not load it by default; it needs to be loaded explicitly — is one of the most commonly used for this purpose. Functions such as read_sas(), read_sav() and read_dta() import datasets specific to the SAS, SPSS and Stata programs, respectively.

It is highly recommended to read the documentation available for the {haven} package to understand how it operates. Fundamentally, it is designed to import data to a intermediary format which stores the additional labeling information in a special format that allows users to access them, but not making them easy to use directly. A suite of packages developed by Daniel Lüdecke from the University of Hamburg offer some additional functionality to work with labels directly when summarising and plotting data. These packages integrate well with the {tidyverse} and are actively maintained, and we will use them in this course to make our lives a bit easier. To install them, run:

# We can install several packages at once by first creating a vector of their names

sj_packages <- c("sjlabelled", "sjPlot", "sjstats", "ggeffects", "sjmisc")

install.packages(sj_packages)

The functions sjlabelled::read_sas(), sjlabelled::read_spss() and sjlabelled::read_stata() are the {sjlabelled} equivalents of the {haven} functions mentioned above. This vignette article included with the package explains the main differences between the two.

As a first step, let’s import the osterman dataset that underpins the Österman (2021) article (see the Data page on the course website for information about the datasets available in this course):

osterman <- sjlabelled::read_stata("https://cgmoreh.github.io/HSS8005-data/osterman.dta")

Using functions learnt in Week 1, do the following:

  1. check the dimensions of the dataset; what does it tell you?
  2. print a summary of the entire dataset; what have you learnt about the data?

A very convenient way to create a codeplan for a dataset – especially if it has value-labelled categorical variables – is offered by the sjPlot::view_df() function, which produces a tables in HTML format that can be saved and consulted to get more information about the variables. With the default settings, we get the following:

sjPlot::view_df(osterman)

The output opens up in the Viewer pane.

There are several additional options that can add useful complexity to the codeplan, as well as the option to restrict it to selected variables. It also works in a “piped” workflow, so it can be combined with {dplyr} verbs such as select to restrict variables beforehand in a more flexible way. Below we request a codeplan with extended options:

sjPlot::view_df(osterman,
                show.na = TRUE, 
                show.type = TRUE, 
                show.frq = TRUE, 
                show.prc = TRUE, 
                show.string.values = TRUE)

The output can also be opened in an external web browser window by clicking on the third (last) icon at the top of the Viewer toolbar. From the browser window, with a Ctrl + Right-click > Save as... we can save the table as an HTML document locally and use it as a reference.

Before moving forward, spend some time examining the codeplan that you have produced and if you haven’t yet had a chance to skim through the Österman (2021) article, have a quick read through their description of the dataset.

Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:

  1. ‘trustindex3’ and ‘eduyrs25’
  2. ‘trustindex3’ and ‘agea’
  3. ‘trustindex3’ and ‘female’

Exercise 2: What factors affect trust?

Fit three simple bivariate OLS regressions using the lm() function:

  1. Regress ‘trustindex3’ on ‘eduyrs25’ and interpret the results
  2. Regress ‘trustindex3’ on ‘agea’ and interpret the results
  3. Regress ‘trustindex3’ on ‘female’ and interpret the results
  4. Regress ‘trustindex3’ on all three predictors listed above and interpret the results

(Advanced) Exercise 3: Apply the model to a new dataset

The ostermann data originates from Waves 1-9 of the European Social Survey. The ESS data are accessible freely upon registration. As part of this exercise, access data from Wave 10 of the survey (from this site: https://www.europeansocialsurvey.org/data/) and perform the following tasks:

  • download the dataset to the Rproject folder
  • select the variables required to recreate the data to fit the multiple regression model from the previous exercise
  • create your version of the ‘trustindex3’ variable
  • fit the models from Exercise 2 and compare the results.

You should already be familiar with the functions needed to complete each of these steps, but it may require some self-study. You will likely need to continue the task outside class. If you succeed in completing the Task, in Week 3 we can compare results.

Download solutions

SOLUTIONS

Exercise 1: Create some basic descriptive graphs using the ggplot() command from the {ggplot2} tidyverse package for the associaton between the following variables:

  1. ‘trustindex3’ and ‘eduyrs25’

The best way to approach this problem is by working through the first examples in Kieran Healy’s Data Visualization: A practical introduction, starting at Chapter 3, and applying them to your data. Outside class, you can develop these basic graphs into better looking ones by adding various extra layers. The ggplot() function is part of the ggplot2 package, which is included in the core tidyverse, so we don’t need to load it separately if we have already loaded the tidyverse.

The ggplot approach to graphs is to build them up step-by-step using various layers. The basic template code for a ggplot() graph is the following:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

For example, the code below sets up a totally blank canvas:

ggplot()

To start populating the canvas we need to add a first layer containing our dataset and the variables we want to ‘map’ using the aes() argument (for “aesthetic mapping”). This adds coordinates to the canvas based on the variables we want to graph (in our case, ‘trustindex3’ and ‘eduyrs25’). We are treating ‘trustindex3’ as the outcome variables in these exercises, so we will want to position it on the y axis:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25))

We now have a basic layer, but no actual data. The final crucial move is to add another layer using the + operator the type of graph (called a “geom” in ggplot, short for “geometric object”, such as a bar, a line, a boxplot, histogram, etc.) that we want to use to represent the relationship between the two variables. In this case, given that both variables are measured on a numeric scale (or at least on an ordinal scale with seven or more categories), the best option is a scatterplot. In ggplot(), a scatterplot “geom” are called with the geom_point() function:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_point()

We now have a scatterplot of the relationship between ‘trustindex3’ and ‘eduyrs25’. Because our scales are rather short and the data is spread out, this scatterplot is not very informative.

We can choose to add another “geom” that is better able to summarise the relationship. The function geom_jitter() (a shortcut to the specification geom_point(position = "jitter")) is helpful in such cases because it adds a small amount of random variation to the location of each point, making areas of overlapping points more visible. The commands below do the same thing:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_point(position = "jitter")

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_point() + 
  geom_jitter()

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_jitter()

Another option is the geom_smooth() function, which provides a set of options, basically returning “smooth lines” representing various types of regression lines. The function fits a regression in the background and graphs the results. The default setting is to fit a generalized additive model that captures non-linearities in the data with a smoothing spline (the Wikipedia article on GAMs gives a maths-heavy outline of these models, but they are beyond our interests here). The smooth line produced is probably more informative about the general idea:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_jitter() +
  geom_smooth()

But we can also specify other regression methods, and because we are here aiming to model the relationship between trust and education as a linear model, we can specify the method as “lm”:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_jitter() +
  geom_smooth(method = "lm")

Now we get a straight regression line, which is basically the visual representation of the bivariate linear regression model that we will fit in Exercise 2 below. There are numerous further specifications that can be added to improve the graph, and Healy’s book is a good resource for ideas that you can play around with. We won’t go into much more detail about these additional options here, but as a taster, let’s say that we want to make the regression line more pronounced by changing its colour to red and make the scatter dots slightly transparent by adjusting the colour’s “alpha” level:

ggplot(osterman, aes(y = trustindex3, x = eduyrs25)) +
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "lm", colour = "red")

  1. ‘trustindex3’ and ‘agea’

We can now do something similar for the relationship between trust and age (the ‘trustindex3’ and ‘agea’ variables in the dataset). Again, both variables are measured on a numeric scale, so a scatterplot should work best. Because we don’t know what to expect and therefore what additional settings would improve each individual graph, we start from the most basic informative layer and build up from there. To practice some alternative approaches to working with plots, here we will first save the basic plot as a ggplot object, to which we can later add further layers and specifications:

age_plot <- ggplot(osterman, aes(y = trustindex3, x = agea)) +
  geom_point()

If no output was generated from the command above, that’s as expected. The graph was produced, but we didn’t ask for it to be printed, we only asked for it to be saved as an object called “age_plot”. To see it, we can simply call “age_plot”. We can then make various additions to this plot.

age_plot

This looks very similar to the previous graph, so we could add the same additional specificaitons as in the previous exercise, this time to the plot object that we saved:

age_plot +
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "lm", colour = "red")

The association between age and trust appears very weak, something that we will explore further in Exercise 2.

  1. ‘trustindex3’ and ‘female’

We can try a similar scatterplot here too, but there may be better options:

ggplot(osterman, aes(y = trustindex3, x = female)) +
  geom_point() +
  geom_jitter(alpha = 0.1) +
  geom_smooth(method = "lm", colour = "red")

This graph can be confusing, are we are better off using another “geom” because the female variable is a dichotomous/binary factor (categorical) variable. A good visualisation tool in the this case is a boxplot, which can be called with the geom_boxplot() function. There will be a challenge, though:

ggplot(osterman, aes(y = trustindex3, x = female)) +
  geom_boxplot()

The issue with this graph is that the female variable is recognised as numeric by R. What we need to do first - or, as part of the ggplot() function itself - is to tell R to treat *female as a factor. We could do the following:

ggplot(osterman, aes(y = trustindex3, x = factor(female))) +
  geom_boxplot()

Or as part of a piped workflow:

osterman |> mutate(female = as_factor(female)) |> 
  ggplot(aes(y = trustindex3, x = female)) +
  geom_boxplot()

But it may be useful to change the variable type in the dataset altogether by saving the mutation and then using the changed female variable:

# We are overwriting the original dataset here, so we better not make a mistake:
osterman <- osterman |> mutate(female = as_factor(female))

# And from now on the 'female' variable will be treated as a factor:
ggplot(osterman, aes(y = trustindex3, x = female)) +
  geom_boxplot()

Exercise 2: What factors affect trust?

Fit three simple bivariate OLS regressions using the lm() function:

  1. Regress ‘trustindex3’ on ‘eduyrs25’ and interpret the results

We will do just that, saving the regression as an object called “m1” (for model 1):

m1 <- lm(trustindex3 ~ eduyrs25, data = osterman)

The model object has now been saved in the Environment, and we can inspect it manually if we want by opening it in the Source window. The object is a large list, with various components that we can call and print separately. The most basic information that we can obtain from the model is the coefficients:

coefficients(m1)
(Intercept)    eduyrs25 
  3.9085904   0.1054227 

This basic information is enough to solve the linear equation underpinning the model:

\[ y_i=b_0+b_1x_i \] The coefficients correspond to the \(b\)’s in this simple model, and we can plug the values in to obtain

\[ trust_i=3.91 + 0.11 \times education_i \]

We find, thus, that the number of years spent in education has a positive outcome on social trust, with each additional year of education associated with a 0.11-points higher score on the trust index, above the baseline of 3.91 points in the case when education is equal to 0. With this formula we can calculate predictions of the trust score for any individual \(i\) from their years of education.

We can also get more information about the model with the summary() function. When applied to a linear model object, it provides the following output:

summary(m1)

Call:
lm(formula = trustindex3 ~ eduyrs25, data = osterman)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5442 -1.1737  0.1426  1.2992  6.0914 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.908590   0.022630   172.7   <2e-16 ***
eduyrs25    0.105423   0.001692    62.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.853 on 68209 degrees of freedom
  (585 observations deleted due to missingness)
Multiple R-squared:  0.05384,   Adjusted R-squared:  0.05383 
F-statistic:  3882 on 1 and 68209 DF,  p-value: < 2.2e-16

This output tells us a lot more about the fitted model, for example a summary table of the residuals and an analysis of variance (ANOVA) summary of the residuals, as well as estimates of variation for our coefficients (the standard errors and the p-values associated with t-tests - displayed as Pr(>|t|)).

While these are informative, the format is not ideal for further manipulation and presentation. Several user-written functions exist to improve on this output. For example, the {broom} package - part of the {tidymodels} suite of packages - has functions to extract model information into “tidy” tibbles (data sets), which can then be further manipulated and plotted. This is especially useful when working with results from many models that would benefit from comparing in a standardised format.

The {sjPlot} package that we used before also has functions to export a publishable-quality table in HTML format:

sjPlot::tab_model(m1)
  trustindex 3
Predictors Estimates CI p
(Intercept) 3.91 3.86 – 3.95 <0.001
eduyrs25 0.11 0.10 – 0.11 <0.001
Observations 68211
R2 / R2 adjusted 0.054 / 0.054

By default the output table shows 95% confidence intervals (CI) instead of standard errors (SE), which can be easier to interpret (CI are calculated as Estimate +/- (1.96 x Std. Error); you can try it out in the Console, replacing in the numeric values).

The best approach is to graph the model results and present them in a figure, but that’s not very informative in the case of a simple model with only one predictor, so we can leave it for later.

  1. Regress ‘trustindex3’ on ‘agea’ and interpret the results

We can do as above:

m2 <- lm(trustindex3 ~ agea, data = osterman)

summary(m2)

Call:
lm(formula = trustindex3 ~ agea, data = osterman)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4185 -1.2439  0.1195  1.4047  4.9126 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.9369533  0.0310778  158.86   <2e-16 ***
agea        0.0060192  0.0005931   10.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.905 on 68794 degrees of freedom
Multiple R-squared:  0.001495,  Adjusted R-squared:  0.00148 
F-statistic:   103 on 1 and 68794 DF,  p-value: < 2.2e-16
  1. Regress ‘trustindex3’ on ‘female’ and interpret the results
m3 <- lm(trustindex3 ~ female, data = osterman)

summary(m3)

Call:
lm(formula = trustindex3 ~ female, data = osterman)

Residuals:
   Min     1Q Median     3Q    Max 
-5.247 -1.247  0.094  1.419  4.761 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.239336   0.010595 494.509   <2e-16 ***
female1     0.008097   0.014560   0.556    0.578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.906 on 68794 degrees of freedom
Multiple R-squared:  4.495e-06, Adjusted R-squared:  -1.004e-05 
F-statistic: 0.3093 on 1 and 68794 DF,  p-value: 0.5781
  1. Regress ‘trustindex3’ on all three predictors listed above and interpret the results

Finally we can fit a multiple linear model with several predictors:

m4 <- lm(trustindex3 ~ eduyrs25 + agea + female, data = osterman)

summary(m4)

Call:
lm(formula = trustindex3 ~ eduyrs25 + agea + female, data = osterman)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7460 -1.1756  0.1325  1.3123  6.0706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.9547287  0.0428943  68.884  < 2e-16 ***
eduyrs25    0.1163995  0.0017345  67.108  < 2e-16 ***
agea        0.0155578  0.0005939  26.196  < 2e-16 ***
female1     0.0411570  0.0141507   2.908  0.00363 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.844 on 68207 degrees of freedom
  (585 observations deleted due to missingness)
Multiple R-squared:  0.06336,   Adjusted R-squared:  0.06332 
F-statistic:  1538 on 3 and 68207 DF,  p-value: < 2.2e-16

One interesting finding from Model 4 is to notice how radically the statistical significance of the female variable changes compared to Model 3. The impact of gender is still very weak in real terms: compared to men of similar age and education level, women score 0.04 points higher on the trust scale; but this is still a stronger effect than in the simple bivariate model (where \(b_1\) was 0.008), and our confidence intervals are much narrower.

We could tabulate these two models in a comparative table to better see the contrast:

sjPlot::tab_model(m3, m4)
  trustindex 3 trustindex 3
Predictors Estimates CI p Estimates CI p
(Intercept) 5.24 5.22 – 5.26 <0.001 2.95 2.87 – 3.04 <0.001
female: female 1 0.01 -0.02 – 0.04 0.578 0.04 0.01 – 0.07 0.004
eduyrs 25 0.12 0.11 – 0.12 <0.001
Age of
respondent,calculated
0.02 0.01 – 0.02 <0.001
Observations 68796 68211
R2 / R2 adjusted 0.000 / -0.000 0.063 / 0.063

It’s worth noticing that the number of observations used in the two models is not the same, due to missing values in some variables. We could make the samples comparable by selecting out the sample of 68,211 included in Model 4, then refitting Model 3 on that sample only:

sample <- m4$model

m3_new <- lm(trustindex3 ~ female, data = sample)

sjPlot::tab_model(m3_new, m4)
  trustindex 3 trustindex 3
Predictors Estimates CI p Estimates CI p
(Intercept) 5.24 5.22 – 5.26 <0.001 2.95 2.87 – 3.04 <0.001
female: female 1 0.01 -0.02 – 0.04 0.582 0.04 0.01 – 0.07 0.004
eduyrs 25 0.12 0.11 – 0.12 <0.001
Age of
respondent,calculated
0.02 0.01 – 0.02 <0.001
Observations 68211 68211
R2 / R2 adjusted 0.000 / -0.000 0.063 / 0.063

We see that this does not affect the overall picture.

(Advanced) Exercise 3: Apply the model to a new dataset

The ostermann data originates from Waves 1-9 of the European Social Survey. The ESS data are accessible freely upon registration. As part of this exercise, access data from Wave 10 of the survey (from this site: https://www.europeansocialsurvey.org/data/) and perform the following tasks:

  • download the dataset to the Rproject folder
  • select the variables required to recreate the data to fit the multiple regression model from the previous exercise
  • create your version of the ‘trustindex3’ variable
  • fit the models from Exercise 2 and compare the results.

You should already be familiar with the functions needed to complete each of these steps, but it may require some self-study. The most important missing information required to compete this task is to be found in the description on how the trustindex3 scale was computed provided by Osterman:

To study generalized social trust, I am following the established approach of using a validated three-item scale (Reeskens and Hooghe 2008; Zmerli and Newton 2008). This scale consists of the classic trust question, an item on whether people try to be fair, and an item on whether people are helpful:
- ‘Generally speaking, would you say that most people can be trusted, or that you can’t be too careful in dealing with people?’
- ‘Do you think that most people would try to take advantage of you if they got the chance, or would they try to be fair?’
- ‘Would you say that most of the time people try to be helpful or that they are mostly looking out for themselves?’
All of the items may be answered on a scale from 0 to 10 (where 10 represents the highest level of trust) and the scale is calculated as the mean of the three items. The three-item scale clearly improves measurement reliability and cross-country validity compared to using a single item, such as the classic trust question. Internal consistency for the three items is reasonably high (Cronbach’s alpha: 0.77). The scale ranges between 0 and 10 with a mean of 5.24 for my sample. See the Supplementary material for additional information on the construction of the social trust scale (Section A.1), as well as for models using the classic single-item measure of trust (Section A.9).


<< the end >>


References

David, F. N. 1955. “Studies in the History of Probability and Statistics i. Dicing and Gaming (a Note on the History of Probability).” Biometrika 42 (1/2): 1–15. https://doi.org/10.2307/2333419.
El-Shagi, Makram, and Alexander Jung. 2015. “Have Minutes Helped Markets to Predict the MPC’s Monetary Policy Decisions?” European Journal of Political Economy 39 (September): 222–34. https://doi.org/10.1016/j.ejpoleco.2015.05.004.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.
Lord, R. D. 1958. “Studies in the History of Probability and Statistics.: VIII. De Morgan and the Statistical Study of Literary Style.” Biometrika 45 (1/2): 282–82. https://doi.org/10.2307/2333072.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second. CRC Texts in Statistical Science. Boca Raton: Taylor and Francis, CRC Press.
Mulvin, Dylan. 2021. Proxies: The Cultural Work of Standing in. Infrastructures Series. Cambridge, Massachusetts: The MIT Press.
Österman, Marcus. 2021. “Can We Trust Education for Fostering Trust? Quasi-experimental Evidence on the Effect of Education and Tracking on Social Trust.” Social Indicators Research 154 (1): 211–33. https://doi.org/10.1007/s11205-020-02529-y.
Senn, Stephen. 2003. “A Conversation with John Nelder.” Statistical Science 18 (1): 118–31. https://doi.org/10.1214/ss/1056397489.
Presentation
Handout